linear regression is a useful tool for predicting a quantitative response
Serves as a good jumping-off point for newer approaches because many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression
Consider the advertising data shown above
Questions that we might ask:
Is there a relationship between advertising budget and sales?
How strong is the relationship between advertising budget and sales?
Which media contribute to sales?
How accurately can we predict future sales?
Is the relationship linear?
Is there synergy among the advertising media?
How could we answer these questions with linear regression?
A very straightforward approach for predicting a quantitative response \(Y\) on the basis of a single predictor variable \(X\)
The primary assumption that we make is that there is an approximately linear relationship between \(X\) and \(Y\)
We assume a model: \(Y = B_0 + B_1X + \epsilon\)
where \(B_0\) and \(B_1\) are two unknown constants that represent the intercept and slope, also known as coefficients or parameters, and \(\epsilon\) is the error term
We may describe the above equation by saying that we are regressing \(Y\) on \(X\) (or \(Y\) onto \(X\))
Once we have utilized our training data to produce estimates \(\hat{B_0}\) and \(\hat{B_1}\) for the model coefficients, we could predict future sales using: \(\hat{y} = \hat{B_0} + \hat{B_1}x\)
where \(\hat{y}\) indicates a prediction of \(Y\) on the basis of \(X = x\)
Our goal is to obtain coefficient estimates \(\hat{B_0}\) and \(\hat{B_1}\) such that our linear model fits the available data well
We want to find an intercept \(\hat{B_0}\) and a slope \(\hat{B_1}\) such that the resulting line is as close as possible to the \(n\) data points
There are a number of ways of measuring closeness. However, by far the most common approach involves minimizing the least squares criterion
Let \(\hat{y} = \hat{B_0} + \hat{B_1}x\) be the prediction for \(Y\) based on the \(i\)th value of \(X\)
Then \(e_i = y_i - \hat{y_i}\) represent the \(i\)th residual
We define the residual sum of squares (RSS) as: \(RSS = e_1^2 + e_2^2 + ... + e_n^2\)
The least squares approach chooses the \(\hat{B_0}\) and \(\hat{B_1}\) that minimizes the RSS
Let’s examine this at work by examining the advertising data
Recall, that we assume that the true relationship between \(X\) and \(Y\) takes the form \(Y = f(X) + \epsilon\), where if \(f\) is to be approximated by a linear function, then we can write the following relationship: \(Y = B_0 + B_1X + \epsilon\)
Where \(B_0\) is the intercept term
And \(B_1\) is the slope
\(\epsilon\) is a catch-all for what we miss with our model
\(Y = B_0 + B_1X + \epsilon\) defines the population regression line, which is the best linear approximation to the true relationship between \(X\) and \(Y\)
The least squares regression coefficient estimates characterize the least square line
Bias
Occurs on the basis of one particular set of observations \(y_1, ..., y_n\) that has the potential to overestimate your parameter of interest, and on the basis of another set of observations, it has the potential to underestimate your parameter of interest
Alternatively, unbiased estimators do not systematically over or underestimate the true parameter
The property of unbiasedness holds for the least squares coefficient estimates (Best Linear Unbiased Estimate or BLUE)
One potential thought you might have concerns the accuracy of the our sample mean \(\hat{\mu}\) as an estimate of \(\mu\)
We have established that the average of \(\hat{\mu}\)’s over many data sets will be very close to \(\mu\), but that a single estimate \(\hat{\mu}\) may be a substantial underestimate or overestimate of \(\mu\). How far off will that single estimate of \(\hat{\mu}\) be?
Standard Error
The standard error of an estimator reflects how it varies under repeated sampling
The standard error of \(\mu\) can be written as \(SE(\hat{\mu})\)
\(Var(\hat{\mu}) = SE(\hat{\mu})^2 = \frac{\sigma^2}{n}\)
Notice, how this deviation shrinks with \(n\). The more observations we have, the smaller the standard error
Where \(\sigma\) is the standard deviation of each of the realizations of \(y_i\) of \(Y\)
The standard error tells us the average amount that our estimate \(\hat{\mu}\) differs from the actual value of \(\mu\)
These standard errors can be used to compute confidence intervals.
Confidence Intervals
A 95% confidence interval is defined as a range of values such that with 95% probability, the range will contain the true unknown value of the parameter. It has the form: \(\hat{B_1}\pm \cdot SE(\hat{B_1})\)
That is, there is approximately a 95% changes that the interval: \([\hat{B_1} - \cdot SE(\hat{B_1}), \hat{B_1} + \cdot SE(\hat{B_1})]\) will contain the true value of \(B_1\)
Standard errors can also be used to perform hypothesis tests on the coefficients
The most common hypothesis test involves testing the null hypothesis of
\(H_0\): There is no relationship between \(X\) and \(Y\)
\(H_A\): There is some relationship between \(X\) and \(Y\)
\(H_0\): \(B_1\) = 0
\(H_A\): \(B_1\) \(\neq\) 0
To test the null hypothesis, we compute a t-statistic given by: \(t = \frac{\hat{B_1}-0}{SE(\hat{B_1})}\)
This will have a \(t\)-distribution with n - 2 degrees of freedom (assuming \(B_1\) = 0)
Using statistical software (i.e., R), it is easy to compute the probability of observing any value equal to |\(t\)| or larger
We call this probability the p-value
We interpret the p-value as follows: a small p-value indicates that it is unlikely to observe such a substantial association between the predictor and the response due to chance, in the absence of any real association between the predictor and the response
Hence, if we see a small p-value, then we can infer that there is an association between the predictor and the response
In order to determine the accuracy of our model, we can computer either the residual standard error or \(R^2\)
Residual Standard Error (RSE)
The RSE is an estimate of the standard deviation of \(\epsilon\). It is the average amount that the response will deviate from the true regression line
\(RSE = \sqrt{\frac{1}{n-2}RSS}=\sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i-\hat{y_i})^2}\)
The RSE provides an absolute measure of lack of fit of the model to the data. But since it is measured in unites of Y, it is not always clear what constitutes a good RSE
\(R^2\) (R-Squared Statistic)
Provides an alternative measure of fit
Takes the form of a proportion - the proportion of variance explained - which means it has a value between 0 and 1 and is independent of the scale of Y
\(R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}\)
Where \(TSS = \sum(y_i - \bar{y})^2\) is the total sum of squares
And \(RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2\)
Therefore, TSS - RSS measures the amount of variability in the response that is explained by performing the regression
\(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\)
An \(R^2\) statistic that is close to 1 indicates that a large proportion of the variability in the response is explained by the regression. Alternatively, a number near 0 indicates that the regression does not explain much of the variability in the response
Simple linear regression is a useful approach for predicting a response on the basis of a single predictor variable
However, in practice we often have more than one predictor
How can we extend our analysis of the data in order to accommodate two additional predictors?
Let’s extend the simple linear regression model so that it can accommodate multiple predictors
We can achieve this by giving each predictor a separate slope coefficient in a single model
Then the multiple linear regression model takes the form: \(Y = B_0 + B_1X_1 + B_2X_2 + ... + B_pX_p + \epsilon\)
We interpret \(B_j\) as the average effect of \(Y\) of a one unit increase in \(X_j\), holding all other predictors fixed
As was the case in the simple linear regression setting, the regression coefficients are unknown and must be estimated
Given estimates \(\hat{B_0}, \hat{B_1},...,\hat{B_p}\), we can make predictions using the formula: \(\hat{y} = \hat{B_0} + \hat{B_1}x_1 + \hat{B_2}x_2 + ... + \hat{B_p}x_p\)
The parameters are estimated using the same least squares approach that we saw in the context of simple linear regression, where we choose \(\hat{B_0}, \hat{B_1},...,\hat{B_p}\) to minimize the sum of squared residuals
\(RSS = \sum_{i=1}^{n}(y_i-\hat{y_i})^2\)
\(= \sum_{i=1}^{n}(y_i-\hat{B_0}-\hat{B_1}x_1-\hat{B_2}x_2-...-\hat{B_p}x_p)^2\)
Results for advertising data
TV, radio, and newspaper advertising budgets are used to predict product sales using the advertising data
Spending an additional $1000.00 on tv advertising is associated with approximately 46 units of additional sales, holding all other variables constant
Spending an additional $1000.00 on radio advertising is associated with approximately 189 units of additional sales, holding all other variables constant
The relationship between newspaper advertising and sales is not statistically significant
library(tidyverse)
data <- read_csv("Advertising.csv")
## New names:
## Rows: 200 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (5): ...1, TV, radio, newspaper, sales
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
MLR <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR)
1). Is at least one of the predictors \(X_1, X_2, ... , X_p\), useful in predicting the response?
Recall that in the simple linear regression setting, in order to determine whether there is a relationship between the response and the predictor we can simply check whether \(B_1\) = 0. In the multiple regression setting with \(p\) predictors, we need to ask whether all of the regression coefficients are zero. We can utilize a hypothesis test to answer this question
\(H_0\): \(B_1 = B_2 = ... = B_p = 0\)
\(H_A\): at least one \(B_j\) is non-zero
We perform this hypothesis test by computing the can utilize the F-statistic: \(F = \frac{(TSS - RSS) / p}{RSS / (n - p - 1)}\)
When there is no relationship between the response and predictors, one would expect the F-statistic too take on a value close to 1
On the other hand, if \(H_A\) is true, then we expect F to be greater than 1
2). Do all the predictors help to explain \(Y\), or is it only a subset of the predictors that are useful?
The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection
We can estimate multiple models with the predictors that we have (i.e., a model with no predictors, a model with X1, a model with X2, and model with both X1 and X2)
From there we can examine the AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), or adjusted \(R^2\) to determine which variables are useful to the model
Forward Selection
Backward Selection
Mixed Selection
3). How well does the model fit the data?
Two of the most common numerical measures of model fit are the RSE and \(R^2\)
Recall, \(R^2\) will always increase when more variables are added to the model, even if those variables are only weakly associated with the response
4). Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Once we have fit the multiple regression model, it is straightforward to predict the response \(Y\) on the basis of a set of values for the predictors \(X_1, X_2,...,X_p\). There are three sorts of uncertainty that will be associated with these predictions:
Our least squares estimates are only estimates for the true population
This inaccuracy in the coefficient estimates is related to the reducible error
Of course, in practice assuming a linear model for \(f(X)\) is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias. So when we use a linear model, we are in fact estimating the best linear approximation to the true surface. However, here we will ignore this discrepancy, and operate as if the linear model were correct
Even if we knew \(f(X)\) that is even if we knew the true values of \(B_0, B_1,...,B_p\) the response value cannot be predicted perfectly because of the random error (\(\epsilon\)) in the model. We refer to this as the irreducible error
How much will \(Y\) vary from \(\hat{y}\)?
We use prediction intervals to answer this question
Qualitative Predictors
Some predictors are not quantitative but are qualitative, taking a discrete set of values
These are also called categorical predictors or factor variables
Example:
Investigate differences in credit card balance between males and females, ignoring the other variables. We create a new variable
\(x_i = \left\{\begin{matrix} 1 \\ 0 \end{matrix}\right.\)
Where \(x_i\) = 1 if the \(ith\) person is female and \(x_i\) = 0 if the \(ith\) person is male
Which results in the following model:
\(y_i = B_0 + B_1x_i + \epsilon\)
For female: \(y_i = B_0 + B_1 + \epsilon\)
For male: \(y_i = B_0 + \epsilon\)
The average credit card balance for males is estimated to be $509.80
Whereas females are estimated to carry $19.73 in additional credit card balance for a total of $509.80 + $19.73 = $529.53
The standard linear regression model provides interpretable results and works quite well on many real-world problems
However, it makes several highly restrictive assumptions that are often violated in practice
The additivity assumption means that the association between a predictor \(X_j\) and the response \(Y\) does not depend on the values of the other predictors
The linearity assumption states that the change in \(Y\) associated with a one-unit change in \(X_j\) is constant, regardless of the value of \(X_j\)
Interactions and Nonlinearity
Interactions:
In our previous analysis of the Advertising data, we assumed that the effect on sales of increasing one advertising medium is independent of the amount spent on the other media
For example, the linear model: \(\hat{sales} = B_0 + B_1 \times TV + B_2 \times radio + B_3 \times newspaper\)
states that the average effect on sales of a one-unit increase in TV is always \(B_1\), regardless of the amount spent on radio
But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases
In this situation, given a fixed budget of $100, 000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio
In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect
interact <- lm(sales ~ TV*radio, data = data)
summary(interact)
The results in this table suggests that interactions are important
The p-value for the interaction term \(TV \times radio\) is extremely low, indicating that there is strong evidence for \(H_A:B_3 \neq 0\)
The \(R^2\) for the interaction model is 96.8% compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term
This means that (96.8 − 89.7)/(100 − 89.7) = 69% of the variability in sales that remains after fitting the additive model has been explained by the interaction term
The coefficient estimates in the table suggest that an increase in TV advertising of 1,000 dollars is associated with increased sales of \((\hat{B_1}+\hat{B_3}\times radio) \times 1000 = 19 + 1.1 \times radio\) units
The coefficient estimates in the table suggest that an increase in radio advertising of 1,000 dollars is associated with increased sales of \((\hat{B_2}+\hat{B_3}\times TV) \times 1000 = 29 + 1.1 \times TV\) units
Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not
The hierarchy principle:
If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant
The rationale for this principle is that interactions are hard to interpret in a model without main effects — their meaning is changed
Specifically, the interaction terms also contain main effects, if the model has no main effect terms
1). Non-linearity of the response-predictor relationships
The linear regression model assumes that there is a straight-line relationship between the predictors and the response
If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect
Residual plots are useful tool for identifying non-linearity
MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)
par(mfrow = c(2, 2))
plot(MLR2)
MLR2 <- lm(sales ~ TV + radio + newspaper, data = data)
summary(MLR2)
datamod <- data %>%
mutate(modfitted = predict(MLR2),
modresiduals = residuals(MLR2))
ggplot(data = datamod, mapping = aes(x = modfitted, y = modresiduals)) +
geom_point() +
geom_hline(yintercept = 0, linetype="dashed", color = "red") +
geom_smooth(se = FALSE) +
scale_y_continuous(breaks = seq(-9, 6, by = 3), limits = c(-9, 6)) +
theme_bw() +
theme(plot.background = element_blank(),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
2). Correlation of error terms
An important assumption of the linear regression model is that the error terms are uncorrelated
The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms
If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be
In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant
3). Non-constant variance of error terms
Another important assumption of the linear regression model is that the error terms have a constant variance
The standard errors, confidence intervals, and hypothesis tests associated with the linear model rely upon this assumption
Heteroscedasticity
Will usually present itself as a funnel shape in the residual plot
The variances of the error terms may increase with the value of the response
When faced with this problem, transform Y (\(log(Y)\), \(\sqrt{Y}\))
4). Outliers
Are a point for which \(y_i\) is far from the value predicted by the model
Residual plots can be used to identify outliers
5). High-leverage points
Have an unusual value for \(x_i\)
In order to quantify an observation’s leverage is to compute the leverage statistic. A large value of this statistic indicates an observation with high leverage
\(h_i = \frac{1}{n}+\frac{(x_i - \bar{x})^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\)
6). Collinearity
refers to the situation in which two or more predictor variables are closely related to one another
Multicollinearity
Variance Inflation Factor
The ratio of the variance of \(\hat{B_j}\) when fitting the full model divided by the variance of \(\hat{B_j}\) if fit on its own
Recall, linear regression is an example of a parametric approach because it assumes a linear functional form for \(f(X)\)
Parametric methods have several advantages, primarily they are easy to fit
Parametric methods do have some disadvantages
they make strong assumptions about \(f(X)\)
Non-parametric methods do not explicitly assume a parametric form of \(f(X)\)
KNN (K-Nearest Neighbors Regression)
Given a value for \(K\) and a prediction point \(x_0\), KNN regression first identifies the \(K\) training observations that are closest to \(x_0\), represented by \(N_0\). It then estimates \(f(x_0)\) using the average of all the training responses in \(N_0\): \(\hat{f}(x_0) = \frac{1}{K}\sum_{x_i \epsilon N_0}y_i\)
In general, the optimal value for \(K\) will depend on the bias-variance trade-off
A small value for \(K\) provides the most flexible fit, which will have low bias but high variance
A large value for \(K\) provides a smoother and less variable fit, however, the smoothing may cause bias by masking some of the structure in \(f(X)\)
When will a parametric approach outperform a non-parametric approach?
The above provides an example of KNN regression on data with 50 observations
The true relationship is given by the black solid line
The blue curve corresponds to \(K = 1\) (on the left) and \(K = 9\) (on the right)
In this instance, the \(K=1\) predictions are far too variable, while the smoother \(K=9\) fit is much closer to \(f(X)\)
As a general rule, parametric methods will tend to outperform non-parametric approaches when there is a small number of observations per predictor
Curse of Dimensionality